37 research outputs found
Options in Scan Processing for Shared-Disk Parallel Database Systems
Shared-disk database systems offer a high degree of freedom in the allocation of workload compared to shared-nothing architectures. This creates a great potential for load balancing but also introduces additional complexity into the process of query scheduling. This report surveys the problems and opportunities faced in scan processing in a shared-disk environment. We list the parameters to tune and the decisions to make, as well as some known solutions and commonsense considerations, in order to identify the most promising areas of future research
Disk Scheduling for Intermediate Results of Large Join Queries in Shared-Disk Parallel Database Systems
In shared-disk database systems, disk access has to be scheduled properly to avoid unnecessary contention between processors. The first part of this report studies the allocation of intermediate results of join queries (buckets) on disk and derives heuristics to determine the number of processing nodes and disks to employ. Using an analytical model, we show that declustering should be applied even for single buckets to ensure optimal performance. In the second part, we consider the order of reading the buckets and demonstrate the necessity of highly dynamic load balancing to prevent excessive disk contention, especially under skew conditions
On Parallel Join Processing in Object-Relational Database Systems
So far only few performance studies on parallel object-relational database
systems are available. In particular, the relative performance of relational
vs. reference-based join processing in a parallel environment has not been investigated sufficiently. We present a performance study based on the BUCKY benchmark to compare parallel join processing using reference attributes with
relational hash- and merge-join algorithms. In addition, we propose a data allocation
scheme especially suited for object hierarchies and set-valued attributes
On Disk Allocation of Intermediate Query Results in Parallel Database Systems
For complex queries in parallel database systems, substantial amounts of data must be redistributed between operators executed on different processing nodes. Frequently, such intermediate results cannot be held in main memory and must be stored on disk. To limit the ensuing performance penalty, a data allocation must be found that supports parallel I/O to the greatest possible extent.
In this paper, we propose declustering even self-contained units of temporary data processed in a single operation (such as individual buckets of parallel hash joins) across multiple disks. Using a suitable analytical model, we find that the improvement of parallel I/O outweighs the penalty of increased fragmentation
Skew-Insensitive Join Processing in Shared-Disk Database Systems
Skew effects are still a significant problem for efficient query processing in parallel database systems. Especially in shared-nothing environments, this problem is aggravated by the substantial cost of data redistribution. Shared-disk systems, on the other hand, promise much higher flexibility in the distribution of workload among processing nodes because all input data can be accessed by any node at equal cost. In order to verify this potential for dynamic load balancing, we have devised a new technique for skew-tolerant join processing. In contrast to conventional solutions, our algorithm is not restricted to estimating processing costs in advance and assigning tasks to nodes accordingly. Instead, it monitors the actual progression of work and dynamically allocates tasks to processors, thus capitalizing on the uniform access pathlength in shared-disk architectures. This approach has the potential to alleviate not only any kind of data-inherent skew, but also execution skew caused by query- external workloads, by disk contention, or simply by inaccurate estimates used in predictive scheduling. We employ a detailed simulation system to evaluate the new algorithm under different types and degrees of skew
A Classification of Skew Effects in Parallel Database Systems
Skew effects are a serious problem in parallel database systems, but the relationship between different skew types and load balancing methods is still not fully understood. We develop and compare two classifications of skew effects and load balancing strategies, respectively, to match their relevant properties.
Our conclusions highlight the importance of highly dynamic scheduling to optimize both the complexity and the success of load balancing. We also suggest the tuning of database schemata as a new anti-skew measure
Skew-tolerantes, dynamisches LPT-Scheduling zur Join-Verarbeitung in parallelen Shared-Disk-Datenbanksystemen
In parallelen Datenbanken, die fĂŒr Decision-Support-Aufgaben wie z. B. Data Warehousing eingesetzt werden, spielen hohe Durchsatzraten, kurze Antwortzeiten und damit auch Lastbalancierungsfragen eine entscheidende Rolle. Dies gilt insbesondere fĂŒr komplexe Operationen wie den relationalen Join. Das gröĂte Problem bei seiner parallelen AusfĂŒhrung sind nichtuniforme Daten- und Werteverteilungen (Skew), die nur begrenzt vorhersehbar sind und somit zur
Laufzeit behandelt werden mĂŒssen. Dies ist in den verbreiteten Shared-Nothing-Rechnerarchitekturen jedoch nur schwer zu realisieren, da Datenumverteilungen mit hohem Zusatzaufwand verbunden sind. Wir schlagen daher ein dynamisches Lastbalancierungsverfahren auf Basis einer Shared-Disk-Architektur vor, welches aufgrund der uniformen Zugriffsstruktur weitaus effizienter arbeitet, als dies in Shared-Nothing-Systemen möglich ist. In einer Simulationsstudiezeigt es sich einem herkömmlichen prĂ€diktiven Algorithmus deutlich ĂŒberlegen
Dynamic Query Scheduling in Parallel Data Warehouses
Data warehouse queries pose challenging performance problems that
often necessitate the use of parallel database systems (PDBS). Although dynamic
load balancing is of key importance in PDBS, to our knowledge it has not yet
been investigated thoroughly for parallel data warehouses.
In this study, we propose a scheduling strategy that simultaneously considers
both processors and disks while utilizing the load balancing potential of a Shared
Disk architecture. We compare the performance of this new method to several
other approaches in a comprehensive simulation study, incorporating skew aspects
and typical data warehouse features such as star schemas
Multi-Dimensional Database Allocation for Parallel Data Warehouses
Data allocation is a key performance factor for parallel database systems (PDBS). This holds especially for data warehousing environments where huge amounts of data and complex analytical queries have to be dealt with. While there are several studies on data allocation for relational PDBS, the specific requirements of data warehouses have not yet been sufficiently addressed. In this study, we consider the allocation of relational data warehouses based on a star schema and utilizing bitmap index structures. We investigate how a multi-dimensional hierarchical data fragmentation of the fact table supports queries referencing different subsets of the schema dimensions. Our analysis is based on realistic parameters derived from a decision support benchmark. The performance implications of different allocation choices are evaluated by means of a detailed simulation model